Zürich Statistical Office collects data on the city and its residents. This data is published as Linked Data.
In this tutorial, we will show how to work with Linked Data. Mainly, we will see how to work with population dataset.
We will look into how to query, process, and visualize it.
Population data is published as Linked Data thatcan be accessed with SPARQL queries.
You can send queries using HTTP requests. The API endpoint is https://ld.stadt-zuerich.ch/query/.
Let's use SparqlClient from graphly to communicate with the database.
Graphly will allow us to:
pandas or geopandas# Installing dependencies for Colab environment
!pip install git+https://github.com/zazuko/graphly.git
Collecting git+https://github.com/zazuko/graphly.git
Cloning https://github.com/zazuko/graphly.git to /tmp/pip-req-build-buukchc8
Running command git clone -q https://github.com/zazuko/graphly.git /tmp/pip-req-build-buukchc8
Collecting geopandas>=0.9.0
Using cached geopandas-0.9.0-py2.py3-none-any.whl (994 kB)
Requirement already satisfied: matplotlib>=3.4.1 in /home/henschel/z/venv/lib/python3.8/site-packages (from graphly==0.1) (3.4.2)
Requirement already satisfied: networkx>=2.5.1 in /home/henschel/z/venv/lib/python3.8/site-packages (from graphly==0.1) (2.5.1)
Requirement already satisfied: pandas>=1.2.4 in /home/henschel/z/venv/lib/python3.8/site-packages (from graphly==0.1) (1.2.4)
Collecting requests>=2.25.1
Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting pyproj>=2.2.0
Using cached pyproj-3.1.0-cp38-cp38-manylinux2010_x86_64.whl (6.6 MB)
Collecting shapely>=1.6
Using cached Shapely-1.7.1-cp38-cp38-manylinux1_x86_64.whl (1.0 MB)
Collecting fiona>=1.8
Using cached Fiona-1.8.20-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (15.4 MB)
Requirement already satisfied: cycler>=0.10 in /home/henschel/z/venv/lib/python3.8/site-packages (from matplotlib>=3.4.1->graphly==0.1) (0.10.0)
Requirement already satisfied: pyparsing>=2.2.1 in /home/henschel/z/venv/lib/python3.8/site-packages (from matplotlib>=3.4.1->graphly==0.1) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/henschel/z/venv/lib/python3.8/site-packages (from matplotlib>=3.4.1->graphly==0.1) (1.3.1)
Requirement already satisfied: pillow>=6.2.0 in /home/henschel/z/venv/lib/python3.8/site-packages (from matplotlib>=3.4.1->graphly==0.1) (8.2.0)
Requirement already satisfied: numpy>=1.16 in /home/henschel/z/venv/lib/python3.8/site-packages (from matplotlib>=3.4.1->graphly==0.1) (1.20.3)
Requirement already satisfied: python-dateutil>=2.7 in /home/henschel/z/venv/lib/python3.8/site-packages (from matplotlib>=3.4.1->graphly==0.1) (2.8.1)
Requirement already satisfied: decorator<5,>=4.3 in /home/henschel/z/venv/lib/python3.8/site-packages (from networkx>=2.5.1->graphly==0.1) (4.4.2)
Requirement already satisfied: pytz>=2017.3 in /home/henschel/z/venv/lib/python3.8/site-packages (from pandas>=1.2.4->graphly==0.1) (2021.1)
Collecting chardet<5,>=3.0.2
Downloading chardet-4.0.0-py2.py3-none-any.whl (178 kB)
|████████████████████████████████| 178 kB 3.0 MB/s eta 0:00:01
Collecting idna<3,>=2.5
Downloading idna-2.10-py2.py3-none-any.whl (58 kB)
|████████████████████████████████| 58 kB 488 kB/s eta 0:00:011
Collecting urllib3<1.27,>=1.21.1
Downloading urllib3-1.26.5-py2.py3-none-any.whl (138 kB)
|████████████████████████████████| 138 kB 3.0 MB/s eta 0:00:01
Collecting certifi>=2017.4.17
Downloading certifi-2021.5.30-py2.py3-none-any.whl (145 kB)
|████████████████████████████████| 145 kB 19.9 MB/s eta 0:00:01
Collecting munch
Using cached munch-2.5.0-py2.py3-none-any.whl (10 kB)
Collecting cligj>=0.5
Using cached cligj-0.7.2-py3-none-any.whl (7.1 kB)
Collecting click-plugins>=1.0
Using cached click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Requirement already satisfied: six>=1.7 in /home/henschel/z/venv/lib/python3.8/site-packages (from fiona>=1.8->geopandas>=0.9.0->graphly==0.1) (1.16.0)
Requirement already satisfied: attrs>=17 in /home/henschel/z/venv/lib/python3.8/site-packages (from fiona>=1.8->geopandas>=0.9.0->graphly==0.1) (21.2.0)
Collecting click>=4.0
Downloading click-8.0.1-py3-none-any.whl (97 kB)
|████████████████████████████████| 97 kB 2.8 MB/s eta 0:00:01
Requirement already satisfied: setuptools in /home/henschel/z/venv/lib/python3.8/site-packages (from fiona>=1.8->geopandas>=0.9.0->graphly==0.1) (44.0.0)
Building wheels for collected packages: graphly
Building wheel for graphly (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /home/henschel/z/venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-buukchc8/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-buukchc8/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-03f9ihui
cwd: /tmp/pip-req-build-buukchc8/
Complete output (6 lines):
usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: setup.py --help [cmd1 cmd2 ...]
or: setup.py --help-commands
or: setup.py cmd --help
error: invalid command 'bdist_wheel'
----------------------------------------
ERROR: Failed building wheel for graphly
Running setup.py clean for graphly
Failed to build graphly
Installing collected packages: certifi, pyproj, shapely, munch, click, cligj, click-plugins, fiona, geopandas, chardet, idna, urllib3, requests, graphly
Running setup.py install for graphly ... done
Successfully installed certifi-2021.5.30 chardet-4.0.0 click-8.0.1 click-plugins-1.1.1 cligj-0.7.2 fiona-1.8.20 geopandas-0.9.0 graphly-0.1 idna-2.10 munch-2.5.0 pyproj-3.1.0 requests-2.25.1 shapely-1.7.1 urllib3-1.26.5
import re
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from graphly.api_client import SparqlClient
ENDPOINT = "https://ld.stadt-zuerich.ch/query/"
sparql = SparqlClient(ENDPOINT)
sparql.add_prefixes({
"schema": "<http://schema.org/>",
"cube": "<https://cube.link/>",
"property": "<https://ld.stadt-zuerich.ch/statistics/property/>",
"measure": "<https://ld.stadt-zuerich.ch/statistics/measure/>",
"skos": "<http://www.w3.org/2004/02/skos/core#>",
"ssz": "<https://ld.stadt-zuerich.ch/statistics/>"
})
SPARQL queries can become very long. To improve the readibility, we will work wih prefixes.
Using the add_prefixes method, we define persistent prefixes.
Every time you send a query, graphly will automatically add the prefixes for you.
Let's find the number of inhabitants in different parts of the city. The population data is available in the BEW data cube.
The query for the number of inhabitants in different city districts, over time looks as follows:
query = """
SELECT ?time ?place ?count
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
ssz:BEW a cube:Cube;
cube:observationSet/cube:observation ?observation.
?observation property:RAUM ?place_uri ;
property:TIME ?time ;
measure:BEW ?count .
?place_uri skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Kreis> ;
schema:name ?place .
FILTER regex(str(?place),"ab|Stadtgebiet vor")
}
ORDER BY ?time
"""
df = sparql.send_query(query)
df.head()
| time | place | count | |
|---|---|---|---|
| 0 | 1408-12-31 | Kreis 1 (Stadtgebiet vor 1893) | 5675.0 |
| 1 | 1467-12-31 | Kreis 1 (Stadtgebiet vor 1893) | 4750.0 |
| 2 | 1529-12-31 | Kreis 1 (Stadtgebiet vor 1893) | 5080.0 |
| 3 | 1637-12-31 | Kreis 1 (Stadtgebiet vor 1893) | 8621.0 |
| 4 | 1671-12-31 | Kreis 1 (Stadtgebiet vor 1893) | 9590.0 |
Let's visualize the number of inhabitants per district. To do this, we will aggregate the numbers per place.
The cleaned dataframe becomes:
df.place = df.place.apply(lambda x: re.findall('Kreis \d+', x)[0])
df = pd.pivot_table(df, index="time", columns="place", values="count")
df.dropna(inplace=True)
df = df[df.columns[np.argsort(-df.iloc[0,])]]
df = df.reset_index().rename_axis(None, axis=1)
df.head()
| time | Kreis 11 | Kreis 3 | Kreis 9 | Kreis 7 | Kreis 6 | Kreis 10 | Kreis 12 | Kreis 2 | Kreis 4 | Kreis 8 | Kreis 5 | Kreis 1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1971-12-31 | 56863.0 | 52707.0 | 47257.0 | 39599.0 | 37837.0 | 36160.0 | 33664.0 | 32708.0 | 32231.0 | 20899.0 | 12833.0 | 9411.0 |
| 1 | 1972-12-31 | 56864.0 | 51674.0 | 47223.0 | 39118.0 | 37763.0 | 35760.0 | 33079.0 | 32561.0 | 31765.0 | 20371.0 | 12462.0 | 9007.0 |
| 2 | 1973-12-31 | 56464.0 | 50879.0 | 47215.0 | 38695.0 | 37059.0 | 35576.0 | 32201.0 | 31925.0 | 30906.0 | 19897.0 | 12235.0 | 8525.0 |
| 3 | 1974-12-31 | 56224.0 | 50175.0 | 47142.0 | 38045.0 | 36305.0 | 35449.0 | 31374.0 | 31706.0 | 30048.0 | 19552.0 | 12165.0 | 8076.0 |
| 4 | 1975-12-31 | 55627.0 | 49326.0 | 46491.0 | 37379.0 | 35294.0 | 35518.0 | 30943.0 | 31179.0 | 29061.0 | 19246.0 | 11798.0 | 7751.0 |
And now we can graph it using a line plot or a histogram.
fig = px.line(df, x="time", y = df.columns)
fig.update_layout(
title='Population in Zürich Districts',
title_x=0.5,
yaxis_title="inhabitants",
xaxis_title="Years",
legend_title="District"
)
fig.show("notebook")
fig = px.histogram(df, x="time", y=df.columns, barnorm="percent")
fig.update_layout(
title='Population in Zürich Districts',
title_x=0.5,
yaxis_title="% of inhabitants",
xaxis_title="Years",
legend_title="District"
)
fig.show()
Let's find the number of foreign and swiss inhabitants. The share of swiss/non-swiss population is available in the ANT-GGH-HEL data cube. The population count is available in BEW data cube.
The query for number of inhabitants and foreigners share over time looks as follows:
query = """
SELECT ?time (SUM(?pop_count) AS ?pop) (SUM(?foreigners_count)/SUM(?pop_count) AS ?foreigners)
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
ssz:BEW a cube:Cube;
cube:observationSet/cube:observation ?obs_bew.
?obs_bew property:TIME ?time ;
property:RAUM ?place_uri;
measure:BEW ?pop_count .
ssz:ANT-GGH-HEL a cube:Cube;
cube:observationSet/cube:observation ?obs_ant.
?obs_ant property:TIME ?time ;
property:RAUM ?place_uri;
measure:ANT ?ratio .
?place_uri skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Kreis> ;
schema:name ?place .
BIND((?pop_count * ?ratio/100) AS ?foreigners_count)
}
GROUP BY ?time
ORDER BY ?time
"""
df = sparql.send_query(query)
df.head()
| time | pop | foreigners | |
|---|---|---|---|
| 0 | 1934-12-31 | 317367.0 | 0.116519 |
| 1 | 1935-12-31 | 318981.0 | 0.110071 |
| 2 | 1936-12-31 | 319849.0 | 0.102763 |
| 3 | 1937-12-31 | 321380.0 | 0.097606 |
| 4 | 1938-12-31 | 329780.0 | 0.100623 |
And now lets visualize the data using absolute numbers as well as percentages.
fig = make_subplots(rows=2, cols=1)
fig.append_trace(go.Scatter(x=df["time"],y=df["pop"],
name="Total population",
marker_color=px.colors.qualitative.Vivid[7],
showlegend=False,
), row=1, col=1)
fig.append_trace(go.Bar(x=df["time"],y=1-df["foreigners"],
name="swiss",
marker_color=px.colors.qualitative.Vivid[3]
), row=2, col=1)
fig.append_trace(go.Bar(x=df["time"],y=df["foreigners"],
name="foreign",
marker_color=px.colors.qualitative.Vivid[9]
), row=2, col=1)
fig['layout']['yaxis']['title']='inhabitants'
fig['layout']['yaxis2']['title']='population share'
fig['layout']['yaxis2']['range']= [0,1]
fig.update_layout(height=800, title={"text": "Population in Zürich", "x": 0.5}, barmode = "stack",
legend = {"x": 1, "y": 0.37})
fig.show()
Let's find the number of inhabitants in different age groups. The population count per age group is available in the BEW-ALT-HEL-SEX data cube.
The query for the number of inhabitants in various age buckets over time looks as follows:
query = """
SELECT ?time ?age (SUM(?measure) AS ?count)
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
ssz:BEW-ALT-HEL-SEX a cube:Cube;
cube:observationSet/cube:observation ?observation.
?observation property:TIME ?time ;
property:ALT ?age_uri ;
measure:BEW ?measure .
?age_uri schema:name ?age .
FILTER (!REGEX(?age, "–", "i"))
}
GROUP BY ?time ?age
ORDER BY asc(?time)
"""
df = sparql.send_query(query)
df.head()
| time | age | count | |
|---|---|---|---|
| 0 | 2002-12-31 | 74 Jahre alt | 2994.0 |
| 1 | 2002-12-31 | 5 Jahre alt | 2779.0 |
| 2 | 2002-12-31 | 73 Jahre alt | 3046.0 |
| 3 | 2002-12-31 | 12 Jahre alt | 2729.0 |
| 4 | 2002-12-31 | 75 Jahre alt | 2855.0 |
Let's calculate the population share for each age group. The dataframe becomes:
df["year"] = df.time.apply(getattr, args=("year", ))
df["count"] = df.groupby(["year"]).transform(lambda x: (x/x.sum())*100)
df['age'] = df['age'].apply(lambda x: int(str(x.split(" ")[0])))
df = df.sort_values(by=["year", "age"]).reset_index(drop=True)
df.head()
| time | age | count | year | |
|---|---|---|---|---|
| 0 | 2002-12-31 | 0 | 0.923338 | 2002 |
| 1 | 2002-12-31 | 1 | 0.872695 | 2002 |
| 2 | 2002-12-31 | 2 | 0.872969 | 2002 |
| 3 | 2002-12-31 | 3 | 0.824516 | 2002 |
| 4 | 2002-12-31 | 4 | 0.779349 | 2002 |
And lets visualize it using an interactive plot.
fig = px.bar(df, x="age", y="count", animation_frame="year", range_y=[0, 3], range_x=[0, df.age.max()])
fig.update_layout(
title='Population Distribution',
title_x=0.5,
yaxis_title="Population Share in %",
xaxis_title="Age",
legend_title="District"
)
fig.show()
Let's take a look at age distribution among swiss and foreign inhabitants. We can find this data in the BEW-ALT-HEL-SEX data cube.
The query for number of inhabitants in various age buckets, with their origin, over time looks as follows:
query = """
SELECT ?age ?origin (SUM(?measure) AS ?count)
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
ssz:BEW-ALT-HEL-SEX a cube:Cube;
cube:observationSet/cube:observation ?observation.
?observation property:TIME ?time ;
property:ALT/schema:name ?age;
measure:BEW ?measure ;
property:HEL/schema:name ?origin .
FILTER (!REGEX(?age, "–", "i"))
FILTER (?time = "2017-12-31"^^xsd:date)
}
GROUP BY ?age ?origin
ORDER BY asc(?age)
"""
df = sparql.send_query(query)
df.head()
| age | origin | count | |
|---|---|---|---|
| 0 | 0 Jahre alt | Schweiz | 3316.0 |
| 1 | 0 Jahre alt | Ausland | 1659.0 |
| 2 | 1 Jahr alt | Ausland | 1613.0 |
| 3 | 1 Jahr alt | Schweiz | 3289.0 |
| 4 | 10 Jahre alt | Ausland | 800.0 |
Let's calculate the population share for each origin and age group. The dataframe becomes:
df["age"] = df["age"].apply(lambda x: int(str(x.split(" ")[0])))
df["count"] = df[["origin", "count"]].groupby(["origin"]).transform(lambda x: x/x.sum()*100)
df = df.sort_values(by=["age"]).reset_index(drop=True)
df.loc[df["origin"]=="Ausland", "origin"] = "foreign"
df.loc[df["origin"]=="Schweiz", "origin"] = "swiss"
df.head()
| age | origin | count | |
|---|---|---|---|
| 0 | 0 | swiss | 1.154552 |
| 1 | 0 | foreign | 1.210039 |
| 2 | 1 | foreign | 1.176488 |
| 3 | 1 | swiss | 1.145151 |
| 4 | 2 | foreign | 1.120326 |
fig = px.bar(df, x="age", y="count",
barmode="overlay", range_y = [0,4], color="origin")
fig.update_layout(
title='Population Distribution',
title_x=0.5,
yaxis_title="Population Share in %",
xaxis_title="Age",
legend_title="Origin"
)
fig.show()
Let's take a look at the age distribution for female and male inhabitants. We can find this data in the BEW-ALT-HEL-SEX data cube.
The query for number of inhabitants in various age buckets, with their sex, over time looks as follows:
query = """
SELECT ?time ?sex ?age (SUM(?measure) AS ?count)
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
ssz:BEW-ALT-HEL-SEX a cube:Cube;
cube:observationSet/cube:observation ?observation.
?observation property:TIME ?time ;
measure:BEW ?measure ;
property:SEX/schema:name ?sex ;
property:ALT/schema:name ?age.
FILTER (!REGEX(?age, "–", "i"))
}
GROUP BY ?time ?sex ?age
ORDER BY asc(?time)
"""
df = sparql.send_query(query)
df.head()
| time | sex | age | count | |
|---|---|---|---|---|
| 0 | 2002-12-31 | weiblich | 52 Jahre alt | 2160.0 |
| 1 | 2002-12-31 | weiblich | 0 Jahre alt | 1657.0 |
| 2 | 2002-12-31 | weiblich | 83 Jahre alt | 1108.0 |
| 3 | 2002-12-31 | männlich | 66 Jahre alt | 1338.0 |
| 4 | 2002-12-31 | weiblich | 19 Jahre alt | 1629.0 |
Let's create a dataframe where one row represents one observation. It will allow us to use violin plots for our dataframe.
The dataframe becomes:
df.loc[df["sex"]=="weiblich", "sex"] = "female"
df.loc[df["sex"]=="männlich", "sex"] = "male"
df['age'] = df['age'].apply(lambda x: str(x.split(" ")[0])).astype(int)
df["year"] = df.time.apply(getattr, args=("year", )).astype(str)
df = df.sort_values(by=["year", "age"]).reset_index(drop=True)
df = df[(df.year == df.year.max()) | ((df.year == df.year.min()))]
df = df[["sex", "age", "year"]].loc[df.index.repeat(df["count"])].reset_index(drop=True)
df.head()
| sex | age | year | |
|---|---|---|---|
| 0 | female | 0 | 2002 |
| 1 | female | 0 | 2002 |
| 2 | female | 0 | 2002 |
| 3 | female | 0 | 2002 |
| 4 | female | 0 | 2002 |
fig = px.violin(df, y="age", x="year", color="sex", violinmode="overlay")
fig.data[0].update(span = [0, 105], spanmode='manual')
fig.data[1].update(span = [0, 105], spanmode='manual')
fig.update_layout(title={"text": "Population distrubution", "x": 0.5})
fig.show()
fig = go.Figure()
fig.add_trace(go.Violin(x=df['sex'][df['year'] == "2002"],
y=df['age'][df['year'] == "2002"],
legendgroup='2002', scalegroup='2002', name='2002',
side='negative',
line_color='blue',
span = [0, 105],
spanmode='manual'))
fig.add_trace(go.Violin(x=df['sex'][df['year'] == "2017"],
y=df['age'][df['year'] == "2017"],
legendgroup='2017', scalegroup='2017', name='2017',
side='positive',
line_color='orange',
span = [0, 105],
spanmode='manual'))
fig.update_traces(meanline_visible=True)
fig.update_layout(title={"text": "Population distrubution", "x": 0.5})
fig.show()